Project 1.1: A RISC-V Assembler (Individual Project)

Computer Architecture I ShanghaiTech University

Project 1.1

IMPORTANT INFO - PLEASE READ

The projects are part of your design project worth 2 credit points. As such they run in parallel to the actual course. So be aware that the due date for project and homework might be very close to each other! Start early and do not procrastinate.

Introduction to Project 1.1

In Project 1.1, you are going to make a simple one-pass RISC-V assembler. The assembler takes RISC-V codes which contain no labels and symbols as input and outputs corresponding machine codes. You also need to implement basic error handling to detect invaild instructions. You can fetch the framework for Project 1.1 here on Github classroom, try to use git and Github for version control.

Background of The Instruction Set

Registers

Please consult the RISC-V Green Sheet (PDF) for register numbers, instruction opcodes, and bitwise formats. Our asembler will support all 32 registers: zero, ra, sp, gp, tp, t0-t6, s0 - s11, a0 - a7. Other register numbers (eg. x0, x1, x2 etc.) shall be also supported. Note that floating point registers are not included in this project.

Instructions

We will have 42 instructions and 6 pseudo-instructions to assemble. The instructions are:

Instruction	Type	Opcode	Funct3	Funct7/IMM	Operation
add rd, rs1, rs2	R	0x33	0x0	0x00	R[rd] ← R[rs1] + R[rs2]
mul rd, rs1, rs2			0x0	0x01	R[rd] ← (R[rs1] * R[rs2])[31:0]
sub rd, rs1, rs2			0x0	0x20	R[rd] ← R[rs1] - R[rs2]
sll rd, rs1, rs2			0x1	0x00	R[rd] ← R[rs1] << R[rs2]
mulh rd, rs1, rs2			0x1	0x01	R[rd] ← (R[rs1] * R[rs2])[63:32]
slt rd, rs1, rs2			0x2	0x00	R[rd] ← (R[rs1] < R[rs2]) ? 1 : 0
sltu rd, rs1, rs2			0x3	0x00	R[rd] ← (U(R[rs1]) < U(R[rs2])) ? 1 : 0
xor rd, rs1, rs2			0x4	0x00	R[rd] ← R[rs1] ^ R[rs2]
div rd, rs1, rs2			0x4	0x01	R[rd] ← R[rs1] / R[rs2]
srl rd, rs1, rs2			0x5	0x00	R[rd] ← R[rs1] >> R[rs2]
sra rd, rs1, rs2			0x5	0x20	R[rd] ← R[rs1] >> R[rs2]
or rd, rs1, rs2			0x6	0x00	R[rd] ← R[rs1] \| R[rs2]
rem rd, rs1, rs2			0x6	0x01	R[rd] ← (R[rs1] % R[rs2]
and rd, rs1, rs2			0x7	0x00	R[rd] ← R[rs1] & R[rs2]
lb rd, offset(rs1)	I	0x03	0x0		R[rd] ← SignExt(Mem(R[rs1] + offset, byte))
lh rd, offset(rs1)			0x1		R[rd] ← SignExt(Mem(R[rs1] + offset, half))
lw rd, offset(rs1)			0x2		R[rd] ← Mem(R[rs1] + offset, word)
lbu rd, offset(rs1)			0x4		R[rd] ← U(Mem(R[rs1] + offset, byte))
lhu rd, offset(rs1)			0x5		R[rd] ← U(Mem(R[rs1] + offset, half))
addi rd, rs1, imm		0x13	0x0		R[rd] ← R[rs1] + imm
slli rd, rs1, imm			0x1	0x00	R[rd] ← R[rs1] << imm
slti rd, rs1, imm			0x2		R[rd] ← (R[rs1] < imm) ? 1 : 0
sltiu rd, rs1, imm			0x3		R[rd] ← (U(R[rs1]) < U(imm)) ? 1 : 0
xori rd, rs1, imm			0x4		R[rd] ← R[rs1] ^ imm
srli rd, rs1, imm			0x5	0x00	R[rd] ← R[rs1] >> imm
srai rd, rs1, imm			0x5	0x20	R[rd] ← R[rs1] >> imm
ori rd, rs1, imm			0x6		R[rd] ← R[rs1] \| imm
andi rd, rs1, imm			0x7		R[rd] ← R[rs1] & imm
jalr rd, rs1, imm		0x67	0x0		R[rd] ← PC + 4 PC ← R[rs1] + imm
ecall		0x73	0x0	0x000	(Transfers control to operating system) a0 = 1 is print value of a1 as an integer. a0 = 4 is print the string at address a1. a0 = 10 is exit or end of code indicator. a0 = 11 is print value of a1 as a character.
sb rs2, offset(rs1)	S	0x23	0x0		Mem(R[rs1] + offset) ← R[rs2][7:0]
sh rs2, offset(rs1)			0x1		Mem(R[rs1] + offset) ← R[rs2][15:0]
sw rs2, offset(rs1)			0x2		Mem(R[rs1] + offset) ← R[rs2]
beq rs1, rs2, offset	SB	0x63	0x0		if(R[rs1] == R[rs2]) PC ← PC + {offset, 1b'0}
bne rs1, rs2, offset			0x1		if(R[rs1] != R[rs2]) PC ← PC + {offset, 1b'0}
blt rs1, rs2, offset			0x4		if(R[rs1] < R[rs2]) PC ← PC + {offset, 1b'0}
bge rs1, rs2, offset			0x5		if(R[rs1] >= R[rs2]) PC ← PC + {offset, 1b'0}
bltu rs1, rs2, offset			0x6		if(U(R[rs1]) < U(R[rs2])) PC ← PC + {offset, 1b'0}
bgeu rs1, rs2, offset			0x7		if(U(R[rs1]) >= U(R[rs2])) PC ← PC + {offset, 1b'0}
auipc rd, offset	U	0x17			R[rd] ← PC + {offset, 12b'0}
lui rd, offset	U	0x37			R[rd] ← {offset, 12b'0}
jal rd, imm	UJ	0x6f			R[rd] ← PC + 4 PC ← PC + {imm, 1b'0}

NOTE: Since our assembler is a one-pass assembler, the offset in SB and U type and imm in UJ type will be integers.

The pseudo-instructions are:

Pseudo-instruction	Format	Uses
Branch on Equal to Zero	beqz rs1, label	beq
Branch on not Equal to Zero	bnez rs1, label	bne
Jump	j label	jal
Jump Register	jr rs1	jalr
Load Immediate	li rd, immediate	lui, addi
Move	mv rd, rs1	addi

For further reference, here are the bit lengths of the instruction components.

R-TYPE	funct7	rs2	rs1	funct3	rd	opcode
Bits	7	5	5	3	5	7

I-TYPE	imm[11:0]	rs1	funct3	rd	opcode
Bits	12	5	3	5	7

S-TYPE	imm[11:5]	rs2	rs1	funct3	imm[4:0]	opcode
Bits	7	5	5	3	5	7

SB-TYPE	imm[12]	imm[10:5]	rs2	rs1	funct3	imm[4:1]	imm[11]	opcode
Bits	1	6	5	5	3	4	1	7

U-TYPE	imm[31:12]	rd	opcode
Bits	20	5	7

UJ-TYPE	imm[20]	imm[10:1]	imm[11]	imm[19:12]	rd	opcode
Bits	1	10	1	8	5	7

Getting Started

File Structure and Usage

The directory tree of the framework should like the following:


    .
    ├── inc
    │   ├── assembler.h
    │   └── util.h
    ├── Makefile
    ├── main.c
    ├── src
    │   ├── assembler.c
    │   └── util.c
    └── test
        ├── test.ref
        └── test.S

main.c is the entry of the whole assembler. You should not modify this file.

assembler.c and assembler.h are where you implement the assembler function.

util.c and util.h contain some helper functions. You can also add useful functions there.

test directory contains a basic test and the correspoding result.

Build & Execute

Run make to compile the code and assembler executable file will be main
Or you can build the code with CMake. First make a directory build. Then run cmake .. && make under build. The executable file will be build/main
To run the assembler, type main input_file output_file . input_file contains RISC-V instructions (see below for detailed description). output_file is where you output your results to.
Run make test to test your codes with test/test.S and your output file will be test/test.out

Input & Output

Input

Input will be a file containing RISC-V instructins. You can assume there are no empty rows and comments and each line ends with a \n. We will use space as delimiter instead of comma, e.g. add x1 x2 x3.

Output

Output shoud be RISC-V machine codes. You should use function dump_code in src/util.c when outputing machine codes. This function will requrie a file handler and a uint32_t variable as parameters, which should be the output file and code to be dumped. Do not use your own output function, otherwise, there may be format problems. Also, do no change the output format in dump_code since we will use your util.c when grading.

Error Handling

If the input file contains some illegal instructions, you should find it and output error information to the output file. You should use function dump_error_information in src/util.c for outputing error information. Once an error occurs, you should continue to assemble the rest instructions and keep outputing results and errors. Also, you should not directly finish the whole program using exit. Quiting unexpectedly will be viewed as run time error.

To simplify the error handling part, we promise that there will only be one space between each string. Also, you do not need to handle cases where there are more or less parameters in an instruction, like addi a0 a1 or addi a0 a0 a0 1. Load/Store instructions will always be the correct format, e.g. lw a0 0(a1). But the correctness of registers and offset is not guaranteed.

Here are situations you need to consider in this project:

Non-existent instruction: All supported instructions are listed above and any other instrcutions should be viewed as illegal.
Bad registers: Wrong names of register or registers which are out of scope should be detected, e.g. rp, x32. You don't need to handle situations like x01 and a-1
Bad immediate or offset: The imm or offset in instructions may not be a number, e.g. addi a0 a1 a0.
Immediate out of range: The immediate in some instructions should be limited into some scope, since the number of bits to represent imm is limited. For example, imm in addi should be between -2048 and 2047. You can refer to Venus and the RISC-V manual for more information about the limitation.

Testing

Diff

Use diff file1 file2 to compare your output with the reference answer. Note that we will use diff to check your answer. To see how to interpret diff results, click here

Valgrind

To check memory leak, you can use Valgrind by running valgrind --tool=memcheck --leak-check=full --track-origin=yes main input_file output_file

Venus

Venus is a powerful assembler and you can use Venus to test the correctness of your code.

First type RISC-V instructions at the editor page. Then at the simulator page, you can see the machine code of each instruction. You can also use Dump button to collect all machine codes as a reference.

Tips

Immediate in auipc and lui should be between 0 and 1048575. Venus views this immediate as an unsigned integer by defulat, while the official manual does not mention this. We choose to follow Venus. For auipc, since the starting address of text is smaller than that of data, PC-relative addresses are always larger than current PC, causing non-negative offset. For lui, it will load upper part of the immediate into the register, which does not care about the sign.
Immediate in jal should be between -1048576 and 1048575. Venus does not limit this immediate for some reasons, even if immediates out of this range can not be represented. However, we are going to follow the hardware limitation.
This project needs a lot of spliting operations. You may find strtok useful.
You need to check whether the immediate in li instruction is between -2048 and 2047. If so, li should be translated into only one addi instruction. Otherwise, it will be translated into lui and addi
Try to generate your own test cases. Codes need testing.
Don't forget writing comments frequently.

Submission

You should submit your code via Github. Please follow the guidance in Gradescope to submit your codes on Github. Note that we will not use your main.c or Makefile for grading. The compilation flag will be -Wpedantic -Wall -Wextra -Wvla -Werror -std=c11.